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Designers of statistical machine translation (SMT) systems have begun to employ tree-structured 



in 
o 

' translation models. Systems involving tree-structured translation models tend to be complex. 
This article aims to reduce the conceptual complexity of such systems, in order to make them 
Q ■ easier to design, implement, debug, use, study, understand, explain, modify, and improve. In 
■ service of this goal, the article extends the theory of semiring parsing to arrive at a novel abstract 
, parsing algorithm with five functional parameters: a logic, a grammar, a semiring, a search strat- 
I egy, and a termination condition. The article then shows that all the common algorithms that 
revolve around tree-structured translation models, including hierarchical alignment, inference 
for parameter estimation, translation, and structured evaluation, can be derived by generalizing 
two of these parameters — the grammar and the logic. The article culminates with a recipe for 
using such generalized parsers to train, apply, and evaluate an SMT system that is driven by 
tree-structured translation models. 



ff^ ■ 1. Introduction 
> . 

jn . Today's best statistical machine translation (SMT) systems are driven by translation 

models that are weighted finite-state transducers (WFSTs) ( Och and Ney, 2002' 'Kum ar and Byrne, 2003^ . 

■ Figure m shows a typical example of a WFST translation model, and the way it is com- 
I posed of a series of sub-transducers. Models of this type and our methods for using 

■ them have become increasingly sophisticated in recent years, leading to steady ad- 
I vances in the accuracy of the best MT systems. However, such translation models run 

c/3 ■ coxmter to our intuitions about how expressions in different languages are related. In 
the short term, SMT research based on WFSTs may be a necessary stepping stone, and it 
is still possible to make improvements by hill-climbing on objective criteria. In the long 
term, the price of implausible models is reduced insight. There is a growing aware- 
ness in the SMT research community that major advances can come only from deeper 
I intuitions about the relationship of our models to the phenomena being modeled. 

From an engineering point of view, modeling translational equivalence using WFSTs 
is like approximating a high-order polynomial with line segments. Given enough pa- 
rameters, the approximation can be arbitrarily good. In practice, the number of param- 
eters that can be reliably estimated is limited either by the amoimt of available training 
data or by the available computing resources. Suitable training data will always be lim- 
ited for most of the world's languages. On the other hand, for resource-rich language 
pairs where the amount of training data is practically infinite, the limiting factor is the 
number of model parameters that fit into our computers' memories. Either way, the 
relatively low expressive power of WFSTs limits the quality of SMT systems. 



* A preliminary version of this article was published by Melamed (2004b). This article improves and 
extends every section of that preliminary version, and adds several new sections. 
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ma botefada a la bruia ve] 

i //// / 



Maria nn daba una botefada a la bruja verde 
GLOSS: 



Mary no give a slap to the witch crpeen 



SWAP : 

Mary no give a slap to the green witch 
DELETE: ^ 

Mary no give a slap the green witch 
REPLACE : | 

Mary not give a slap the green witch 
INSERT: ^ 

Mary did not give a sla.p the green witch 
MERGE : ^T""'^ 

Mary did not slap the green witch 

Figure 1 

Translation by finite-state transduction. Adapted from Knight and Kohn (2003[ . 



To advance the state of the art, SMT system designers have begim to experiment 
with tree-structured translation models fWu, 1996"'Alshawi, 1996''Yamada and Knight, 2002 
[Gildea, 2003 Chiang, 2005 e.g.). Tree-structured translation models have the potential 
to encode more information using fewer parameters. For example, suppose we wish 
to express the preference for adjectives to follow nouns in language LI and to precede 
them in language L2. A model that knows about parts of speech needs only one pa- 
rameter to record this binary preference. Some finite-state translation models can en- 
code parts of speech and other word classes fOch, Tillmann, and Ney, 1999 1. However, 
they cannot encode the preferred relative order of novm phrases and adjectival phrases, 
because this kind of knowledge involves parameters over recursive structures. To en- 
code such knowledge, a model must be at least tree-structured. For example, a syntax- 
directed transduction schema (SDTS) I Aho and U llman, 1969 1 needs only one parame- 
ter to know that an English noun phrase of the form (Det AdjP N) such as "the green 
and blue shirt" translates to Arabic in the order (Det N AdjP). A well-known principle 
of machine learning is that, everything else being equal, models with fewer parameters 
are more likely to make accurate predictions on previously unseen data. 

Several authors have added tree-structured irtformation to systems that were pri- 
marily based on WFSTs ( Koeh n, Och, and M arcu, 2003 Eng et al., 20031. Such a sys- 
tem can be easier to build, especially given pre-existing software for WFST-based SMT. 
However, such a system cannot reach the potential parameter efficiency of a tree-structured 
translation model, because it is still saddled with the large number of parameters re- 
quired by the underlying WFSTs. Such hybrid systems are improving all the time. Yet, 
one cannot help but wonder how much faster they would improve if they were to shed 
their historical baggage. 

To realize the full potential of tree-structured models, an SMT system must use them 
as the primary models in every stage of its operation, including training, application to 
new inputs, and evaluation. Switching to a less efficient model at any stage can result 
in an explosion in the number of parameters necessary to encode the same information. 
If the resulting model no longer fits in memory, then the system is forced to lose infor- 
mation, and thus also accuracy. ^ Even when memory is not an issue, the increased 
number of parameters risks an increase in generalization error. 

For these reasons, among others, the SMT research community is highly motivated 
to build systems whose every process is driven primarily by tree-structured models. 

^An alternative is to swap ttie model out to secondary storage, slowing down the system by several 
orders of magnitude. 
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Unfortunately, from a naive point of view, such systems tend to be conceptually com- 
plex — much more complex than WFST-based systems. Complex systems present sig- 
nificant obstacles to research: 

• They take a long time to design, to implement, to debug and to document. 
Given the fast pace of our field of research, good software engineering is usually 
postponed until after the next conference paper deadline. So, typical implemen- 
tations are difficult to modify and to extend even for their authors, let alone for 
anybody else. 

• The large number of possible variations in the algorithms involved and in their 
parameters makes it difficult to rim controlled experiments. The large number 
of independent variables makes it difficult to assign credit /blame for changes in 
the system's accuracy. Improvements are typically obtained by trial and error, 
and followed by post-hoc explanations that may or may not be scientifically 
valid. 

• The large number of variables that can affect the outcome of an experiment 
make the experiments difficult to describe in detail. Experiments that are not 
fully described are difficult to replicate. Perhaps this is why most of the lit- 
erature to date on tree-structured translation models compares those models 
only to variations of themselves and to WFST-based models, but not to other 
tree-structured models in the literature. 

Despite the fast pace of research in this area, it is likely that research would progress 
more quickly if it were not hindered by the above obstacles. 

The primary aim of this article is to reduce the conceptual complexity of SMT sys- 
tems driven by tree-structured translation models, and thereby to reduce the obstacles 
outlined above. In service of this goal, Section|2lextends the theory of semiring parsing 
to arrive at a novel analysis of many common parsing algorithms. This analysis led to 
two insights, which are expounded in Sections 01 through |51 First, under a certain pa- 
rameterization, all of the non-trivial algorithms that are necessary for this approach to 
SMT are special cases of just one algorithm. Second, the one key algorithm that is nec- 
essary for this type of SMT is a direct generalization of ordinary parsing. These insights 
imply that: 

• Implementation of an SMT system driven by tree-structured translation mod- 
els requires only one non-trivial software module. The software engineering ef- 
fort of the implementation, as well as of any subsequent extensions, is thereby 
reduced by an order of magnitude. This reduction in effort makes the enter- 
prise feasible for a much larger number of researchers. The "Statistical Machine 
Translation by Parsing" Team at the 2005 JHU Language Engineering Workshop 
took advantage of this new-found feasibility to build the first publicly available 
toolkit for machine translation by parsing.^ 

• An innovation or improvement in one algorithm will often be applicable to 
all the others. Conversely, a deeper understanding of the relationships among 
these algorithms can lead to new insights about the whole class. 

• Many of the problems that SMT research will encounter can be solved by gen- 
eralizing existing solutions from the parsing literature. Such generalizations 
typically require less effort than completely new solutions, as this article shall 
demonstrate. 

•^See lhttp : / /www, clsp . jhu . edu/ws2005/qroups/statistical/GenPar .html| . 
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The article makes no empirical claims about the merits of tree-structured translation 
models or translation by parsing. Instead, the aim is to reduce the effort necessary for 
research into what those merits might be. 

2. Anatomy of a Parser 

In natural language processing, a parser is an algorithm for inferring linguistic struc- 
ture. We limit our attention to parsers that infer structure incrementally using a gram- 
mar I Juraf sky and Martin, 20001, rather than by reranking a list of pre-existing struc- 
tures (Collins and Koo, 2005J, or by inferring a n entire parse tree as a point in a high- 
dimensional feature space jTaskar et al., 2004^ .^ However, the grammar need not be 
generative or probabilistic. Our only requirement for the grammar is that it should 
assign values to parts of parse tree structures. These values can range over booleans 
(structure is possible or not), pro babilities, feature weights jChiang, 2005| , or other val- 
ues such as confidence estimates i Turia n and Melame d, 20051. 

To facilitate our generalization of ordinary parsers to algorithms necessary for SMT, 
we shall recast them in terms of an abstract parsing algorithm with five functional pa- 
rameters: a grammar, a logic, a semiring, a search strategy, and a termination condition. 
We shall then express all the algorithms necessary for SMT by generalizing two of those 
parameters — the grammar and the logic. This characterization will make it easier to 
compare and contrast these algorithms. The use of logics to describe parsers is not new 
(e.g. Shieber, Schabes, and Pereira (1995; and references therein)). ^Klein and Manning (2003| 
have compared different search strategies for a fixed parsing logic and graimnar. The 
parameterization of parsing algorithms by semirings was studied by 'Goodman (1998), 
who also presented an abstract parsing algorithm. The abstract parsing algorithm in 
Section [2.51 is more detailed and more general. Before presenting this algorithm, we 
shall explain some of its parameters. We presume that readers are already familiar with 
probabilistic context-free grammars and ordinary parsers ^urafsky and Martin, 2000^ . 

2.1 Logics for Parsing 

A parser's logic determines the parser's possible states and transitions. The specifica- 
tion of a parsing logic has three parts: 

• a set of term type signatures, 

• a set of inference rule type signatures, 

• a set of axiom terms. 

Terms are the building blocks of inference rules. Items are terms that represent par- 
tial parses. Terms that represent grammatical constraints such as production rules are 
sometimes called grammar terms. When the parser runs, the term and inference rule 
types are instantiated and their variables are assigned values. The state of a parser can 
be uniquely specified by the values of all possible terms. 

In the parser's initial state, all terms have a particular default value, such as "false" 
or "zero probability," depending on the semiring (see below). Axioms are term in- 
stances (not types) that are assigned non-default values during the parser's initial- 
ization procedure. The most common kinds of axioms come from the grammar and 
from the input. Typically, each input word gives rise to an axiom. If the grammar in- 
volves production rules, then each production rule becomes an axiom too. As a parser 

■^At the time of writing, parsing by structured classification is too expensive to train for practical pur- 
poses, and reranking approaches rely on the kind of parsers that we focus on. 
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Table 1 

Logic Die. Wi are input words; X, Y and Z are nonterminal labels; Hs a terminal; i and j are 
word boundary positions; n is the length of the input. 



Term Types 

terminal items 




nonterminal items 




ternunating productions 


X ^ t 


nonterminating productions 


A. ^ ¥ Z 


Axioms 

input words 


{i, Wi) for 1 < i < n 


grarmnar terms 


as given by the grarmnar 


Inference Rule Types 




Scan 


{i,t) , X^t 
[X;{i-l,i)] 


Compose 


, fc)] , X^Y Z 


[X;{i,k)] 



rims, it can change the values of terms from their initial value to some other value. 
[Melamed (2004b| presented a different formulation, where the values of terms are ini- 
tially unknown. The present formulation is cleaner because it obviates the need for term 
values that are not semiring values (see Section lZ3i . 

Inference rules describe the parser's possible transitions from one state to another. 
We shall express inference rules as sequents: means that the value of the con- 

sequent y depends on the values of the antecedents xi, . . . ,Xk- For example, if we 
are dealing with probabilities, then the probability of the consequent might be defined 
as the product of the probabilities of the antecedents. The exact relationship between 
these values depends on the semiring, explained below. Every change in term value 
corresponds to the invocation of an inference rule where that term is a consequent. 

For example, consider Logic DlC, shown in Tabled This is a logic for parsing 
under context-free grammars (CFGs) in Chomsky Normal Form (CNF). This logic has 
four term types. Two term types represent production rules in the grammar. The other 
two term types are items. Each of the logic's terminal items relates a terminal symbol to 
a word position. Each of the logic's nonterminal items relates a nonterminal symbol to 
a span. Each span consists of boundaries i and j, which range over positions between 
and around the words in the input. The position to the left of the first word is zero, and 
the position to the right of the jth word is j. Thus, < i < j < n, where n is the length 
of the input. 

Parser DlC is any inference procedure based on Logic DlC. For every run. Parser DlC 
is initialized with axioms that represent its grammar 's production rules and the words 
in its input. It then commences to fire inferences. A Scan inference can fire for the ith 
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word Wi if that word appears on the right-hand side (RHS) of a terminating production 
in the grammar. If a word appears on the RHS of multiple productions (with differ- 
ent left-hand sides), then multiple Scan inferences can fire for that word. The span of 
each item inferred by a Scan inference always has the form {i — because such items 
always span one word, so the distance between the item's boundaries is always one. 

Parser DlC spends most of its time composing pairs of items into larger items. It 
can Compose two items whenever they satisfy both of the following constraints: 

• Immediate Dominance (ID). Their labels must match the nonterminals on the RHS 
of a nonterminating production rule in the grammar. 

• Linear Precedence (LP). The order of the items' spans over the input must match 
the order of their labels in the antecedent production rule. 

If two spans overlap, then their order is undefined, and they cannot Compose. Thus, the 
LP constraint ensures that no part of the input is covered more than once, so that every 
partial parse is a tree (rather than a more general graph). The reason items store their 
spans it to help the parser to enforce the LP constraint. 

Some previous publications included a goal item in the logic specification. For ex- 
ample, jGoodman, 1998t 's parsing logics specify the goal of finding a constituent that 
covers the input text and has the grammar's start symbol as its label. More generally, 
however, goal items can vary independently of the logic. For example, we might want 
to use Logic DlC to find all the noun phrases in the input, rather than a single parse for 
the whole sentence. For this reason, our logics do not specify goals. 

2.2 Search strategies 

A parsing logic specifies how terms can be inferred, but it does not specify the order of 
inferences. When a parser needs an inference to evaluate, it consults its search strategy. 
[Goodman (1998t required one particular search strategy for his abst ract parsing algo- 
rithm, which depended on a topological sort of all possible terms. Melame d (2004b) 
used inference rules to specify a partial order on the computations of term values, al- 
though he allowed the order to be determined on the fly. Here we leave all ordering 
decisions to the search strategy, which may or may not consult the logic. A variety of 
search strategies are in common use. For example, the CKY algorithm (Kasami, 19651 
Yoimger, 1967| always infers smaller items before larger ones. Alternatively, given term 
costs such as negative log-probabilities, we can run the parser as a uniform-cost search, 
inferring less costly consequents before more costly ones. If we are interested in just the 
single best parse or the n-best parses, then A* strategies of varying sophistication can 
be employed to speed up the search ("Klein and Manning, 2003). The benefit of a sep- 
arate search strategy is the usual benefit of abstraction: analyses of logics unobscured 
by search strategies are applicable to a larger class of algorithms, as we shall show in 
Section|51 

2.3 Semirings for Parsing 

A semiring consists of a set, binary operators ® and over that set, and an identity 
element in the set for each of the two operators. For example, we can define a semiring 
over the set of integers, where © and (g) are the usual addition and multiplication opera- 
tors, and the identity elements are and 1. A semiring's set need not consist of numbers 
and its operators need not be arithmetic. 

Of particular relevance here are semirings that have been proposed specifically for 
the purpose of describing parsing algorithms in a compact way. Parsing semirings in- 
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teract with parsing logics according to the following equation: 



fc 



V{y) 



e 



(1) 



Xi , . . . , X}^ 

such that 



V 



In this equation, x and y range over terms, and V{) is a function that maps terms to 
semiring values (i.e. elements of the semiring's set). The equation says that the semiring 
value of any term is a sum, over all inferences where that term is a consequent, of the 
product of the values of the antecedents of the inference. The definitions of sum and 
product here depend on the semiring. Some examples will help to make these abstract 
ideas more concrete. 

The boolean semiring over the set {TRUE, FALSE} defines © as disjunction and ® as 
conjunction. Under this semiring, the default term value is FALSE. A term can become 
TRUE in one of only two ways: 

1. A term is TRUE if it is an axiom. This is usually the case for grammar terms and 
items representing input words. 

2. According to Equation^ a term is TRUE if it is the consequent of some inference 
rule where all the antecedents are TRUE. 

If neither of the above conditions holds, then the term retains its default FALSE value. 
Starting from the parser's initial state, we can run the parser under a Boolean semiring 
to determine the truth value of the item that spans the input and has the grammar's 
start symbol as its label. A TRUE value indicates that the grammar can generate that 
input. 

If the grammar guiding the parser is probabilistic, then it's possible to use an inside- 
probability semiring, where © is real addition and ® is real multiplication. Under this 
semiring, the grammar assigns probabilities to the grammar terms. We can run the 
parser under the inside-probability semiring to compute the total probability of any 
item. The probability of the item that spans the input and has the grammar's start 
symbol as its label is the probability of the grammar generating the input. 

[Goodman (1998| studied the above semirings and a variety of other semirings that 
are useful for parsing, including: 

• the Viterbi semiring for computing the probability of the single most probable 
derivation; 

• the Viterbi-derivation semiring for computing the single most probable deriva- 



• the Viterbi-n-best semiring for computing the n most probable derivations; 

• the derivation-forest semiring for computing all possible derivations; 

• the counting semiring for computing the number of possible derivations. 

The probabilistic semirings can be straightforwardly extended to unnormalized weights. 
The expectation semiring I Eisner, 2002 1 can be used to compute expected probabilities, 
as well as expected feature counts for maximum entropy models and derivatives for 
gradient-based optimization methods. All of these parsing semirings apply equally 
well to all the classes of algorithms that we discuss in this article. 

''Or the set of most probable derivations, if there are ties. 
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2.4 Termination Conditions 

Different termination conditions are appropriate for different parsing applications. Most 
applications involve a goal item, such as an item that spans the input and is labeled 
with the start symbol of the grammar. Then the termination condition is that no further 
inference can change the value of the goal item. Some applications, such as those in 
Sections 17.21 and l8l involve multiple goal items. There the termination condition must 
hold conjunctively for all goal items. 

In practice, termination conditions often cannot be expressed solely in terms of goal 
items and their values. For example. Barley parsing logic (Goodm an, 1998| Section 2.1.1) 
might be used to compute the probability of a string under the inside semiring and 
a PCFG that is not in CNF. If the PCFG has cycles of unary productions (like {A ^ 
B,B ^ A}), then the parser will not terminate, because it will be computing an infinite 
sum. There are methods for computing such sums in closed form ( Stolcke, 1995 1, and it 
is possible to augment the parser with those methods. However, most parsing applica- 
tions resort to approximations, because such approximations are easier to implement. 
A typical implementation limits the computing resources that a run of the parser can 
expend. So, in addition to goal items, the termination condition might test the elapsed 
time, the size of allocated memory, or the number of inferences fired, possibly as a func- 
tion of the input size. 

2.5 Abstract Parsing Algorithm 

jGoodman (1998^ presented an abstract parsing algorithm whose parameters are a logic, 
a grammar, and a semiring. His algorithm employs a search strategy that depends on a 
priori computation of dependencies among all possible terms, so that the terms can be 
topologically sorted into "buckets." The parser's inferences are then fired in the order 
of their consequents' buckets. Goodman's algorithm also assumes that the termination 
condition is based on a particular goal item. Table|2presents a more detailed and more 
general abstract parsing algorithm, where the search strategy and termination condition 
are parameters, along with the logic, the grammar, the semiring, and the input text. 

The parser initializes all possible terms with 0^, the value of the additive identity 
element of the semiring.^ It then re-initializes axiom terms. It consults the grammar for 
the value G{p) of each grammar term p'. The grammar must be compatible with the 
semiring so that G{p) is always a semiring value (i.e. an element of the semiring's set). 
Ordinarily, all other axioms are assigned the semiring's multiplicative identity element 
Ir. However, if the input is nondeterministic, then the input axioms can take other 
values. E.g., they might be weighted by the acoustic module of a speech recognizer. 

After initialization, the parser enters its main loop. On each iteration of the main 
loop, the parser first calls the search strategy to select a set of antecedent terms. The 
parser places no restrictions on how the search strategy might do so. However, a typical 
search strategy would keep track of which sets of antecedents it selected previously, 
to avoid duplication of effort. It would then return antecedent sets that have either 
(a) never been selected before, or (b) have had one of their element's values changed 
since the previous time they were selected. If the search strategy cannot find a set of 
antecedents with one of these properties, it would return the empty set, which might 
satisfy the termination condition. 

When the parser receives the set of antecedents from the search strategy, it passes 
them to the logic. The logic compares the antecedents to the signatures of its inference 
rules. For every matching inference rule, the logic instantiates every possible conse- 
quent. It passes all the consequents from all matching inference rules back to the parser. 



typical implementation would not store terms that have this value, so this step would do nothing. 
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Table 2 

Abstract Parsing Algorithm 



Input: 

• logic L, 

• grammar G, 

• semiring R, 

• search strategy S, 

• termination condition C, 

• textT 



for all possible terms x ^ Ldo 

V{x) = Qr > Ofl is the additive identity element of R. 

for all axioms p' G L corresponding to a production rule p G G do 

V{p') = G{p) > G{p) is the value that G assigns to p. 

for all axioms w' & L corresponding to word w € T do 

F(w') = T(w) i> T{'w) = Ifl if T is unambiguous. 

for all other axioms a e -L do 

V{a) = 1r [> Ijj is the multiplicative identity element of R. 

repeat 

get a set of antecedents X = {x\, . .. ,Xk} from S 

for all inferences I & L such that X unifies with the antecedents of / do 

for all possible terms y that imify with the conseq. of "/ imified with X" do 

SetAntSet{y) = SetAntSet{y) U {X} (2) 

1^1 

viy)= (3) 

XeSetAntSet(v) i=l 

15: until C satisfied 



10: 
11: 
12: 
13: 

14: 
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The parser then performs two updates for each consequent. Equation |2l updates 
the record of the consequent's set of antecedent sets (SetAntSets). Equation 13 uses the 
two operators of the semiring to recompute the consequent's value from the values of 
its SetAntSets. The SetAntSets data structure is the same as the set of "back-pointers" 
necessary to represent a packed derivation forest computed under the derivation-forest 
semiring. For some other semirings, this structure is partially or totally redundant, and 
more efficient updates are possible. For example, under the Viterbi-derivation semiring, 
it is necessary to keep track of only the highest-probability antecedent set. However, we 
don't know of any way to optimize Equation |5] that would be correct for all parsing 
semirings. 

Under some semirings. Equation |3| can a lso be optimized to reuse previous com- 
putations, along the lines of (Goodman, 1998 1's "Item Value Formula." However, such 
optimizations can invalidate nonmonotonic parsing logics, such as those that involve 
pruning (see Section l5!4l . Another possible optimization is to move Equation 13 outside 
the main loop, and compute V{y) just once for each y. This optimization is impractical 
in the common scenario where the termination condition involves a time limit. Of all 
the abstract parsing algorithms that we are aware of, the algorithm in Table|2lis the only 
one that admits all known parsing logics, semirings, search strategies, and termination 
conditions. 

3. Generalized Parsers 

In an ordinary parser, the input is a single string, and the grammar ranges over strings. 
A convenient way for an SMT system to create and use tree-structured translation mod- 
els is via generalizations of ordinary parsing algorithms that allow the input to consist 
of string tuples and/or the grammar to range over string tuples. The kind of string tu- 
ples that are most relevant here are texts that are translations of each other, also called 
parallel texts or multitexts. Each multitext consists of component texts or components. 
Borrowing from vector algebra, we shall use dimension as a s5monym for component, 
so the number of components in a given multitext is its dimensionality. 

Figure 13 shows some of the ways in which ordinary parsing can be generalized. A 
multiparser^ is an algorithm that can infer the tree structure of each component text in 
a multitext and simultaneously infer the correspondence relation between these struc- 
tures.^ When a parser's input can have fewer dimensions than the parser's grammar, 
we call it a translator. When a parser's grammar can have fewer dimensions than the 
parser's input, we call it a hierarchical aligner, or just an aligner when the context is 
imambiguous. ^ The corresponding processes are called multiparsing, translation and 
hierarchical alignment, respectively. 

Many previously published algorithms can also be viewed as generalized parsers 
<Aho and Ullman, 1969; Wu, 1996, Alshawi, 1996. Hwa et al., 2002 e.g.). Some of these 
other parsers are fundamentally similar to our parsers and to each other. Others are 
superficially similar but subtly different. For example, some of the algorithms that have 
been put forth as generalizations of the CKY algorithm turn out to be more complicated 

^An equivalent term is synchronous parser IMelamed, 20031 . 

suitable set of monolingual parsers can also infer the tree structure of each component, but it cannot 
infer the correspondence relation between these structures. 

*This class of algorithms has many names in the literature, such as structural matching 
IMatsumoto et a l., 19931 , sub-sentential alignment i Groves, Hearne, and Way, 2004j, and synchronization 
(Melamed, 2004br There are also names for proper subclasses of algorithms, such as "tree alignment" 
iMeyers, Yangarber, and Grishman, 19961 and "biparsing alignment" iWu, 2000 1. Although "alignment" tra- 
ditionally referred to monotonic relations (i.e. writhout crossing correspondences), we follow what seems to 
have become standard nomenclature in computational linguistics. 
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Figure 2 

Two perspectives on the space of generalized parsers. 



than our generalization (see Section]^), and therefore possibly more complicated than 
necessary. As we shall show, the similarities and differences are easier to see when the 
semiring, the search strategy, and the terminating conditions are abstracted away. 

Taking advantage of the clarity provided by these abstractions, we shall elucidate 
the relationships between several classes of generalized parsers: 

• The class of ordinary parsers is a proper subclass of the class of multiparsers, 
because the grammars and logics used for ordinary parsing are special cases of 
the grammars and logics used for multiparsing. 

• The class of multiparsers is a proper subclass of the class of translators, because 
the logics of multiparsing are a subset of the logics of translation. 

• The class of multiparsers is also a proper subclass of the class of hierarchical 
aligners, because the grammars used for multiparsing are a subset of the gram- 
mars used for hierarchical alignment. 

These relationships could not have been spelled out as precisely without the abstrac- 
tions in Section|5] 

Most of the rest of this article is a guided tour of the generalized parsers that are use- 
ful for SMT by Parsing. The next section describes the kind of grammar that generalized 
parsers use, and presents a particular grammar formalism that will serve as a vehicle 
for our tour. The three sections after the next give detailed examples of multiparsers, 
translators, and hierarchical aligners. Then, Sections 17.21 and 151 present two additional 
generalized parsers that are necessary for a complete system. All of the algorithms on 
the tour are special cases of the abstract parsing algorithm in Table|2 

4. Grammars for Generalized Parsing 

To parse string tuples, we need grammars that can evaluate structures over string tu- 
ples, rather than just structures over strings. Grammars that can evaluate structures 
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Pat went home early 



S[went] NP[Pat]^ V[went\'^ Adv[home]^ Adv[early]'^ 

S[pashol] ^ Adv[damoyf NP[PatY Adv[rano]'^ V[pasholf 




Figure 3 

Discontinuous constituents are practically unavoidable in generalized parsing, for example 
when inferring this production rule in English and (transliterated) Russian. In the illustration on 
the right, arrows are 2D bilexical dependencies, and shaded squares are 2D constituents. 



over string tuples are often called synchronous grammars.^ This article is not about 
grammar formalisms, but for expository purposes it is convenient to use one particu- 
lar formalism as a running example. Our choice of synchronous grammar formalism is 
informed by certain properties of popular monolingual grammar formalisms and pars- 
ing algorithms. To minimize the conceptual leap from ordinary parsing to generalized 
parsing, we shall employ a grammar formalism that is similar to CFG, in that it uses 
production rules and is associated with a context-free derivation process. 

Another consideration is that popular monolingual grammar formalisms, such as 
CFG, TAG, and CCG explicitly express subcategorization frames, and recognize that 
subcategorization frames often have more than two dependents. In formalisms that 
involve production rules, such subcategorization frames are expressed via production 
rules that have more than two nonterminals on the RHS. For inferring such productions, 
it is always more efficient to binarize the grammar (either explicitly or implicitly) than 
to allow a parser to compose more than two parts of the input at a time. However, 
in general, binarization of synchronous production rules can result in discontinuous 
nonterminals. 

Early synchronous grammars, such as syntax-directed transduction schemata (SDTSs) 
(Aho and Ullman , 1969t and their subclass of inversion transduction grammars (ITGs) 
jWu, 19971, were defined for contiguous constituents only. Therefore, these formalisms 
cannot generate certain important multitext correspondence patterns using binary deriva- 
tion trees. For example, an SDTS that allows up to four symbols on the RHS of a produc- 
tion rule can generate the correspondence pattern in FigureOJ but no SDTS with a lower 
limit can generate it. The distinguishing characteristic of such patterns is that no two 
constituents are adjacent in more than one dimension. Each set of sibling constituents in 
such a pattern must be encapsulated in the RHS of a single nonterminating production 
rule, like the one shown in Figure|3] 

If the grammar is bilexical, then even productions with only three nonterminals on 
the RHS can require discontinuous constituents for binarization. This case can arise, 
for example, when two prepositional phrases switch places in translation, as shown in 
Figure HI Bilexical parsers typically compose dependents with the head-child, rather 
than with other dependents, because otherwise some items would need to keep track 
of multiple lexical heads, increasing computational complexity. Each of the dependents 
in Figure m is adjacent to the head-child in only one dimension. Regardless of which 
dependent is attached first, a discontinuous item will result. 

^They were originally called transduction grammars Aho and Ulknan, 1969'i, but we follow the major- 
ity of the litera ture in avoiding this term so as to de-emphasize the input-output connotation of "transduction" 
(cf. |Wu(1997| p. 378)). 
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a gift for you from France 



[ a gift ) [for you) [from France) 




[un cadeau] [ de France] [pour vous) 



un 

cadeau 
de 

France 
pour 
vous 



Figure 4 

Discontinuous constituents are required for bilexical parsing, even for simple bitexts in English 
and French. 



Synchronous grammar formalisms that do not allow discontinuous constituents are 
unlikely to have adequate coverage, even for multitexts involving languages that are 
S5mtactically similar fZens and Ne y, 2003) . [Simard e t al. (200'5 ^ have presented empiri- 
cal evidence that an SMT system can perform better if it can manipulate discontinuous 
constituents. To our knowledge, the simplest synchronous grammar formalism that 
deals in discontinuities is generalized multitext grammar (GMTG) (Melamed, Satta, and Wellir\gton, 2004} . 
GMTGs are a complete generalization of CFGs to the synchronous case: GMTGs can 
express arbitrary tree structures over arbitrarily many parallel texts. This article uses 
GMTG as a running example of a tree-structured translation model because GMTG is 
the simplest formalism that can be used to illustrate the concepts that we consider im- 
portant. More sophisticated formalisms would be necessary to represent a variety of 
translational divergence patterns fDorr, 1994 1, but our abstract parsing algorithm can 
accommodate them without modification. In a similar spirit, Goodman (1998| Section 2- 



B.2) presented a parsing logic for tree-adjoining grammars that can be used in his ab- 
stract parsing algorithm without modification. 

Every GMTG is a D-GMTG for some integer constant D > 0, and it generates mul- 
titexts with D components. Thus, 1-GMTGs generate ordinary texts and 2-GMTGs gen- 
erate bitexts. A GMTG has disjoint sets of terminals T and nonterminals A^. We often 
group terminals and nonterminals into vectors that we call links. Links express the 
translational equivalence between their components. In GMTG applications, the dif- 
ferent components of a link will often come from largely disjoint subsets of T or N, 
representing the vocabularies and linguistic categories of different languages. Every 
link generated by a D-GMTG has D components, some of which may be inactive^". An 
inactive link component indicates that the active components vanish in translation to 
the inactive component. 

Each GMTG also has a set of production rules (or just productions for short). A 
production in 2-GMTG might look like this: 



(4) 



There is one row per production component, on both the left-hand side (LHS) and the 
right-hand side (RHS). Each symbol in parentheses on the LHS is a nonterminal. Each 
component of the RHS is a string of terminals and /or indexed nonterminals. The in- 
dexes are not part of the nonterminal labels; they exist only in production rules. The 

Inactive component are distinct from components that contain the empty string 
{Melamed, Satta, and WeUington, 2004} . This distinction obviates the need to keep track of the positions of 
empty strings during parsing. 
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same nonterminal symbol may appear multiple times in the same component or in dif- 
ferent components, either with the same index or with a different index (like A). 

The indexes express translational equivalence: All the nonterminals with the same 
index constitute a link. The derivation process rewrites linked nonterminals as atomic 
units. Some nonterminals on the RHS might have no translation in some components, 
in which case there will be no co-indexed nonterminal in those components (as for 
and C^). The derivation process rewrites such nonterminal links like any other link, 
generating parse subtrees that are inactive in some components. In the limit, a non- 
terminal symbol in one dimension might not be coindexed with any other nontermi- 
nal symbol in its production rule. Repeated rewriting of such degenerate nonterminal 
links can generate arbitrarily deep one-dimensional subtrees that correspond to other 
dimensions only at their root. A GMTG can generate tuples of such subtrees to repre- 
sent translational equivalence among "phrases," a concept that is currently popular in 
SMT <Koehn, Och, and Marcu, 2003| e.g.). Terminals never have indexes because they 
are never rewritten. 

The production rule notation des cribed above, which is the original notation of 



Melamed, Satta, and Wellington (2004| , uses superscripts to superimpose information 
about translational equivalence on top of information about the linear order of con- 
stituents. This notation highlights the relationship between GMTG and the familiar 
CFG. Unfortimately, this notation is not conducive to describing the way that gram- 
mar terms interact with the inference rules in parsing logics. In order to specify infer- 
ence rule signatures completely and compactly, we introduce an alternative notation 
for GMTG productions that have nonterminals on the RHS. The new notation separates 
information about translational equivalence from information about the linear order of 
constituents, enabling independent reference to each type of information. 
Here is Production ^ rewritten in the new notation: 



[1,2,3] 
[4,1] 



A e C 
D A 



(5) 



In this notation, nonterminal links are written in columns, and their linear order is in- 
dicated by a preceding vector of special data structures called precedence arrays, one 
array per component. E.g., the precedence array in the second component above is [4, 1]. 
The first index in this array is 4, referring to the fourth column in the link vector, and 
indicating that A comes first in that component. The special symbol acts as a place- 
holder for inactive link components. The indexes in a precedence array never refer to 
links that are inactive in their component. If the LHS link is inactive in a given compo- 
nent, then all the links on the RHS must also be inactive, and vice versa. In that case, 
the component is called inactive and the precedence array must be empty. Precedence 
arrays are more informative than the role templates used by Melamed (2003|, because 
role templates obscure link information. The 1X1 ("join") operator rearranges the sym- 
bols in each component's link vector according to that component's precedence array to 
recover the original production rule notation. For example. 



[1,2,3] ■ 
[4,2,1,5] 
[3,2,4] 



^A B C 0\ A^B^C^ 
Y X W Z = W^X^Y^Z^ (6) 
U V T / V^U^T^ 



All the precedence arrays in a given production rule constitute a precedence array vec- 
tor (PAV). 

Precedence arrays can express discontinuities. They can also indicate how to ar- 
range parts of discontinuous subconstituents. For example, suppose that the first com- 
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A' 



D' 
Y'' a' 
D' 



1 









1 





Figure 5 

Example of a GMTG production involving discontinuities on both the LHS and the RHS. 
Numbers indicate co-indexation. 



ponent of Production (0 had a gap between the e and the C. Suppose further that the D 
in the second component contained a gap, and that the A in that component filled that 
gap. Then the production might be written like this: 



X 
Y 



[1,2; 3] 
[1,4,1] 



A e C 
D A 



(7) 



In the first component, the semicolon in the precedence array indicates the position of 
the gap. In the second component, the precedence array indicates that the two parts of 
D (the nonterminal in the first link) should wrap aroimd the A (the nonterminal in the 
fourth link). The multitree fragment generated by this production rule is illustrated in 
Figure |51 Precedence arrays are more general than permutations, because precedence 
arrays can refer to the same position more than once, as for the D in the second compo- 
nent above. 

For each production in the original CFG-style notation, there are many ways to 
re-express it in the new notation. The existence of multiple ways to express the same 
constraint is called spurious ambiguity, and it leads to wasted effort during parsing. To 
avoid spurious ambiguity, we stipulate a normal form for production rules in the new 
notation. The normal form requires that, if the arrays in the PAV are concatenated, then 
the first appearance of an index i must precede the first appearance of an index j for all 
i < j, except where the arrangement is incompatible with an earlier choice of indexes. 
We could, for example, obtain the same result in Equation|6|if we put 0Z0 before 0T4^T 
and switch their indexes in the 2nd and 3rd precedence arrays. However, the normal 
form requires the 2nd precedence array to be [4, 2, 1, 5], not [5, 2, 1, 4], so 0Z0 must be 
listed last in the link vector. There is a one-to-one correspondence between production 
rules in the new notation that are in normal form and production rules in the original 
notation. 

For simplicity, we shall limit our attention to GMTGs in Generalized Chomsky 
Normal Form (GCNF) ^Melamed, Satta, and Wellington, 2004[ . This normal form al- 
lows simpler al gorithm descriptions than the normal forms used by |Wu (1997^ and 



Melamed (2003t . In GCNF, every GMTG production is either a terminating production 
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or a nonterminating production. Terminating productions have the form 




X =^ t 




(8) 



All components except for one are inactive. The active component has a single nonter- 
minal symbol on the LHS and a single terminal symbol on the RHS. Nonterminating 
productions have the form 



where every X, Y, and Z is either a nonterminal symbol or and every vr is a precedence 
array. In GCNF, every nonterminating production must have exactly two nonterminal 
links on the RHS. These two links may or may not have any active dimensions in com- 
mon. However, whenever Xi is 0, both Yi and Zi must also be 0, and vice versa. Each 
link can have an arbitrary number of discontinuities, which means that the precedence 
arrays can be arbitrarily long. However, every index in those arrays is either 1 or 2. 

The fan-out of a constituent, a nonterminal symbol, or a precedence array is the 
number of its contiguous elements. The fan-out of a PAV is the sum of the fan-outs of its 
component precedence arrays. E.g., the fan-out of the PAV in Production|7|is 2 + 1 = 3. 
The fan-out of a GMTG is the maximum of the fan-outs of the PAVs in its production 
rules. A 1-GMTG with a fan-out of 1 is a CFG. 

The GMTG derivation process can be represented by a derivation tree, just like the 
derivation process of CFGs. As for CFG, GMTG derivation trees are identical to the 
resulting parse trees. Several graphical representations are common for such trees, as 
illustrated in Figure |6l For example, consider a GMTG with the production rules in 
TablelUa). That GMTG can derive the structure in Figure|6las shown in TablelS^b). The 
multidimensional perspective in Figure|6tc) led us to refer to such trees as multitrees. 

Due to the importance of lexical information in disambiguating linguistic structure, 
we shall have reason to discuss lexicalized GMTGs (LGMTGs) of the bilexical variety 
(L2GMTGS). In an L2GMTG, every nonterminal symbol has the form L[t\ for some ter- 
minal t e T and some label L £ A. A is a set of "delexicalized" nonterminal labels. 
Intuitively, A corresponds to the nonterminal set of an ordinary CFG. The terminal t 
is the lexical head of its constituent. One nonterminal in each component on the RHS 
of an L2GMTG production serves as the head-child of the nonterminal in the corre- 
sponding component on the LHS. The head-child inherits the lexical head of its parent 
nonterminal. 

5. Logics for Generalized Parsing 
5.1 Discontinuous Spans 

We now introduce some notation for describing discontinuities in parse items, and some 
machinery for operating on them. Expanding on johnson (1985j , we define a discontin- 
uous span (or d-span, for short) as a list of zero or more intervals (61, ei; . . . ; 6™, e™), 
where 




(9) 
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(a) ordinary tree view 




PAS 



Pasudu 



WASH 



Wash 



(b) parallel view 




S/S 



(c) 2D view 



NP 



WASH 
Wash 



D 
the 



NP 


N 


PAS 






Pasudu 






MIT 




V 








moy 



N 



DISH 
dishes 



Figure 6 

A 2D multitree in English and transliterated Russian. The three representations are equivalent: 
(a) Every internal node is annotated with the linear order of its children, in every component 
where there are two children. (b,c) Polygons are constituents. 
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Table 3 

A 2-GMTG with the production rules in (a) can derive the multitree in Figure|5|as shown in (b). 
Production f13t is not used in the derivation. 



(a) 



S 
S 

NP 
NP 

V 
V 

N 

N 



WASH 



1X1 



1X1 



1X1 



[1,2 
[2,1 

[1] 
[2,1 

[1] 
[2] 

[1] 
[2] 



Wash 



NP V 
NP V 

N 

N D 



MIT 



WASH 



PAS 



DISH 



WASH clean 


D the 


DISH ^ dishes 

PAS Pasudu 

^ 

MIT moy 



(10) 
(11) 
(12) 
(13) 
(14) 
(15) 
(16) 
(17) 
(18) 
(19) 



(b) 



S 
S 



NP^ 

D^ N^ 



iV3 

WASH^ 

PAS^ 
WASH^ 

PAS"^ 
WASH^ 

PAS"^ 
WASH'^ 

PAS'^ 
Wash 



MIT^ 
D^ 

MIT^ 
D^ 

MIT^ 
the 



iV3 



DISH^ 



DISH^ 



moy 

the DISH^ 
moy 

the DISH^ 



Pasudu moy 

Wash the DISH^ 

Pasudu moy 

Wash the dishes 
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• the bi are span beginning positions and the e, are span ending positions, so that 

• Ci < bi+i, which means that the intervals do not overlap; 

• a d-span is proper if all the above inequalities are strict; i.e., each span has non- 
zero width and there is a gap between each pair of consecutive intervals; 

• an empty d-span is denoted by (). 

As in ordinary spans, d-span boundaries range over positions between and around the 
words in a text. Parse items have one d-span per dimension. We shall denote vectors 
of d-spans by ly, a and r. A d-span that pertains to only one particular dimension d 
is denoted with a subscript, as in Ud- When a label or a d-span variable has both a 
superscript and a subscript, it refers to a range of dimensions. E.g., ct* is a vector of 
d-spans, one for each dimension from itoj. 
We define two operators over d-spans. 

• + is the ordered concatenation operator. Given two d-spans, it outputs the 
union of their intervals. E.g., (1, 3; 8, 9) + (7, 8) = (1, 3; 7, 9). Ordered concate- 
nation is commutative: a + t = t + a. 

• I is the relativization operator^^ . Given a sequence of d-spans, it computes the 
precedence array that describes the contiguity and relative positions of their 
intervals. E.g., (1, 3; 8, 9) } (7, 8) — [1; 2, 1], because if these two d-spans were 
concatenated, then the result would consist of the 1st interval of the 1st d-span, 
followed by a gap, followed by the 1st interval of the 2nd d-span, followed by 
the 2nd interval of the 1st d-span. Relativization is not commutative. 

The inputs of + and I must have no overlapping intervals, or else the output is unde- 
fined. Both operators apply componentwise to vectors of d-spans. 

5.2 Logic C 

TablelUcontains Logic C, which is a generalization of Logic DIG to arbitrary GMTGs in 
GGNF. Parser G is any parser based on Logic G. The input to Parser G is a tuple of D 
parallel texts, with lengths ni, . . . , tid. 

The term types used by Logic G are direct generalizations of the term types used by 
Logic DIG. The grammar terms represent terminating and nonterminating production 
rules of a GMTG in GGNF, rather than a GFG in GNF. The terminal items of Logic G have 
the same variables as the terminal items of Logic DIG, plus an additional variable d to 
indicate the input component to which an item pertains. Logic G's nonterminal items 
consist of a D-dimensional label vector Xjy and a /^-dimensional d-span vector crj^ . The 
items need d-spans, rather than ordinary spans, because Parser G needs to know all the 
boundaries of each item, not just the outermost boundaries. Since GMTGs can gener- 
ate multitexts with components of unequal length, a d-span in one component of an 
item might cover more words than a d-span in another component. In particular, some 
(but not all) dimensions of a nonterminal item can be inactive, having an empty d-span 
and no label. Such lower-dimensional items are necessary for representing multitree 
branches that are inactive in some components. A typical goal item used with Logic G 
would be a constituent covering the input multitext and labeled with the grammar's 
start link. An example of such a constituent is the outermost rectangle in Figure|6tc). 

^^ IMelamed (2003| used the (g) symbol for this operator, but we rename it here to avoid confusion with 
this symbol's traditional use in describing semirings. 
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Table 4 

Logic C: D is the dimensionality of the grammar and d ranges over dimensions; Ud is the length 
of the input in dimension d; id ranges over word positions in dimension d, 1 < i^j < n^; Wd.i^ are 
input words; X, Y and Z are nonterminal symbols; t is a terminal symbol; vr is a PAV; a and r are 
d-spans. 



Term Types 

terminal items 




{d,i,t) 




■nnTitpTTTiiTi^il ii'f^Tnti 








terminating productions 


X 


=^ t ioTl< d< D 


nonterminating productions 


xh zj,) 


Axioms 

input words 


{d, id, Wd^t^) iov 1 < d < D,l < id < ?id 


grammar terms 


as 


given by the grammar 


Inference Rule Types 








Scan component d, 1 < d < D 


0^-1 0^- 

{d.i,t) , X ^ t 

0^+1 0^+ 


1 
1 




\ 0Li 0^-1 1 

X ; {i-l,i) 




Compose 









Parser C begins by firing Scan inferences, just like Parser DlC, but it can Scan from 
each of the D input components. A Scan inference can fire for the ith word Wd,i in 
component d if that word appears in the rfth component of the RHS of a terminating 
production in the grammar. Scan consequents have empty spans and no labels except 
in the active component. 

The parser can also Compose pairs of items into larger items. The antecedents of a 
Compose inference can have the same number of active components or a different num- 
ber. If both antecedents have inactive components, then their active components may 
or may not be the same. For example, to derive the parse tree in Figure|6| Logic C must 
make two inferences involving antecedents that have no active components in common. 
These are the inferences that compose two preterminals each^^. If the active components 
of one antecedent are a subset of the active components of the other, then the inference 

^^The preterminal nodes of a parse tree inferred under a GMTG in GCNF are always active in only one 
component. 
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asserts that some of the yield of the higher-dimensional antecedent vanishes in transla- 
tion. An example of such an inference is the composition that would infer the NP/NP 
node in Figure |6l 

Logic C's conditions for item composition are the ID and LP constraints described 
in Section IZTl generalized to possibly discontinuous items of arbitrary dimensionality. 
Both constraints now apply componentwise to every component of the antecedents. 
The LP constraint is now expressed using the d-span relativization operator defined 
in Section ISAI Parser C can compose two items if the contiguity and relative order of 
their span intervals is consistent with the PAV of the antecedent production rule. Under 
our new notation for production rules, the LP constraint is completely independent 
of the nonterminal labels. Such independence of constraints is desirable for modular 
implementation, as well as for concise logic specification. A complete specification of 
Logic C using the original notation for production rules would require 0(4^) different 
Compose inference rule signatures. 

Logic C is simpler and more general than the parsing logics used by |Wu (1997) and 
IMelamed (2003;. Both t he Link inference rule in .Melamed (2003/ s Parser R2D2A and 
Equation 1 in ^Wu, 1997^ compose terminal items, but neither logic permits monolingual 
nonterminal items to compose with each other. In contrast. Logic C never composes ter- 
minals, so it involves only two types of inference rules. However, its Compose inference 
rule is more general because it admits composition of two lower-dimensional items that 
are active in the same dimension, composition of two items that are active in different di- 
mensions, and compositions of two items that are active in a different number of dimen- 
sions, in addition to the usual compositions of items that are active in all dimensions. 
Simplicity of description does not preclude computational complexity. However, con- 
ceptual complexity correlates with difficulty of engineering. To our knowledge, there 
have been no studies of the relative benefits of the two kinds of bottom-up logic. In 
the absence of evidence in support of more complicated logic, Occam's razor supports 
Logic C. 

5.3 Worst-Case Computational Complexity 

The abstract parsing algorithm in Table |2 has several sources of computational com- 
plexity. If the simplest possible search strategy is used (such as CKY), then the domi- 
nant source of complexity is the logic. We shall analyze the space and time complexity 
of any parser bas ed on Logic C, using an extension of the static analysis method of 
McAIIeste r (20021. 

The worst-case space complexity of a parser is within a constant factor of the maxi- 
mum number of possible distinct term instances that it needs to keep track of. A term's 
signature uniquely determines how the term can combine with other terms, so two 
terms that have the same values for the variables in the signature will never differ on 
whether they can participate in an inference rule. Therefore, we never need to store 
more than one term with the same variable values. The number of unique combina- 
tions of variable values is the product of the sizes of the variables' ranges. 

For a given GMTG G, let / be the fan-out of G, and let |7V| be the number of nonter- 
minal symbols in G. Let n be the length of the longest component of the input multitext. 
We assume that n is always smaller than the size of G's terminal set. Then the number of 
possible distinct terminal items in Parser C will be negligible compared with the number 
of possible distinct nonterminal items. The free variables in a nonterminal item's signa- 
ture are its nonterminal symbol and span boundary in each dimension. The maximum 
number of required boimdaries is exactly 2/, and each of the boimdaries can range over 
0{n) possible positions. Thus, the space complexity of Parser C for a given _D-GMTG 
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G is in 0{\N\-^n'^f). If G is bilexical, then the number of possible nonterminals hides a 
factor of n^, raising the space complexity of Parser C to 0(|Apn^+^-^ ). 

If the search strategy imposes an ordering of inferences that guarantees correct- 
ness and avoids duplication of effort/'' then the worst-case running time of the abstract 
parsing algorithm is a product of three factors: the number of possible unique infer- 
ence rule instantiations, the computational effort required for each instantiation^*, and 
an implementation-specific constant. The number of possible unique inference rule in- 
stantiations is the product of the sizes of the ranges of the free variables that appear in 
the inference rules. For Parser C, these variables are the nonterminals and the d-spans. 
The PAVs are not free variables because they are uniquely determined by the d-spans. 
Assuming a fixed maximum fan-out / for the given grammar, the number of different 
spans in each inference depends on how many boundaries are shared between the an- 
tecedent items. In the best case, all the boundaries are shared except the two outermost 
boundaries in each dimension, and the consequent is contiguous. In the worst case, no 
boundaries are shared, and the inferred item stores all the spans of the antecedent items. 
In any case, if y and z are the fan-out of the composed items, and x is the fan-out of the 
inferred item, then the number of free boimdaries in a Compose inference is x + y + z. 
Thus, in the worst case, the number of free boundaries involved in a Compose infer- 
ence is 3/. Each of these boundaries can range over 0{n) possible values. Thus, there 
are 0{n?^) possible different d-span values. There are three nonterminals per dimen- 
sion, which can have 0(|A^p^) possible different values. Finally, each inference rule 
instantiation requires the computation of the PAV in the antecedent grammar term and 
the computation of the d-span in the consequent, each at a cost in 0{f). The total time 
complexity of Parser C is in 0{f\N\^^n?f). For a binarized L2GMTG, which also needs 
to keep track of two lexical heads per dimension per inference, this complexity rises to 

0(/|A|3^n2I3+3/). 

We presented Logic C for its descriptive simplicity (only two inference rule types) 
and familiarity (from the CKY algorithm), not for its efficiency. Many other parsing 
logics are possible, and some of them offer lower worst-case time complexity with no 
loss of generality (Eisner and Satta, 1999 Melamed, 20031. Nevertheless, the worst-case 
computational complexity of generalized parsing will always be at least as high as that 
of ordinary parsing. 

5.4 Efficiency Despite Complexity 

For most practical applications, monolingual parsing in 0{rr'\N\^) is irtfeasible. If gen- 
eralized parsing is even more expensive, some would argue, then it will never be more 
than a theoretical curiosity. Yet, monolingual parsers are used daily in academia and 
in industry, because the average run times of well-engineered parsers are typically 
just a tiny fraction of the theoretical worst case. The same is true for WFST-based 
SMT, which involves inference algorithms with exponential computational complexity 
I Knight, 1999 (, and which is nevertheless the dominant approach in the field. Evidence 
is beginning to emerge that, as for these other classes of theoretically expensive algo- 
rithms, worst-case computational complexity should not prevent anyone from using 
generalized parsers I Chiang, 2005 Ding and Palmer, 20051. 

One of the advantages of machine translation by generalized parsing is that its 
practitioners need only generalize the efficiency mechanisms that have already been 
developed for ordinary parsers. The two main techniques used to speed up parsers are 
priming (also known as "thresholding") and outside cost estimation l |Goodman, 1998| 



In general, age nda-based search strategies offer no sucti guarantee. 
The analysis of |Melamed (2003| omitted this factor 
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Caraballo and C harniak, 1998| Klein and Manning, 2003| e.g. ). The parsing logics of 
Goodman (1998 , Chapter 5) use outside cost estimates for making decisions about prun- 
ing. However, it is also possible to prune without estimating outside costs. Let us con- 
sider how these techniques can speed up generalized parsing. 

jGoodman (1998| augmented his parsing logics with pruning by adding side-conditions 
to the inference rules. Side-conditions on inference rules are always boolean tests, even 
if the semiring involved is not boolean. Side-conditions used for pruning test whether 
the semiring values of certain terms in the inference rule are larger than certain other 
values, such as the values of certain axiom terms (constants), the values of other terms 
in the same inference rule, and/ or other values recorded in the parse state. Candidate 
inferences are discarded without firing if their side-conditions are evaluated to be false. 

Side conditions involving properties of the parse state that are neither constants nor 
local to the inference rule usually render the logic nonmonotonic, perhaps even able 
to remove elements from SetAntSets. If pruning functionality is added at a sufficiently 
high level of abstraction, then nonmonotonicity need not significantly increase the diffi- 
culty of correct implementation. The side-condition test can be added between lines 12 
and 13 of the abstract parsing algorithm in Table |2| If the side-condition is false, the al- 
gorithm removes the antecedent set from the consequent's SetAntSet instead of adding 
it there, and proceeds to line 14. Since the abstract parsing algorithm is independent of 
the dimensionality of the input or the grammar, it can apply side-conditions from logics 
for generalized parsing in exactly the same way as from logics for ordinary parsing. 

The outside cost estimate of a term is an estimate of the difference between the cost 
of that term and the cost of a possible descendant goal term. A* estimates are a well- 
known special subclass of outside cost estimates used in parsers. Outside cost estimates 
can be used to guide the search strategy towards terms that are more likely than others 
to be on the path to the goal. Since search strategies are independent of the dimension- 
ality of the input or the grammar, the necessary modifications to the search strategy 
in a generalized parser are the same as they are fo r the search strategy in an ordinary 
parser, so we refer the reader elsewhere for details jKlein and Manning, 2003 e.g.). The 
necessary modifications to the parsing logic can vary, depending on what additional 
information the search strategy needs about the state of the parse. For example, to 
compute outside costs for his monolingual bottom-up parsing logics, jGoodman (1998^ 
augmented them with new types of "summary" terms, which keep track of outside 
costs for equivalence classes of ordinary terms. These new term types are then used in 
side-conditions to make pruning decisions. 



6. Translation 



A _D-GMTG can guide a multiparser to infer the hidden structure of a _D-component 
multitext. Now suppose that we have a Z3-GMTG and an input multitext with only / 
components, where I < D. When some of the component texts are missing, we can ask 
the parser to infer a Z3-dimensional multitree that includes the missing components, 
which are supplied by the grammar. The resulting multitree will cover the / input 
components/dimensions among its D dimensions. It will also express the D — I output 
components/dimensions, along with their tree structures. When a parser's input can 
have fewer dimensions than the parser 's grammar, we call it a translator. 

6.1 Translator CT 

Table 121 shows Logic CT, which is a generalization of Logic C. The items of Logic CT 
have a Z)-dimensional label vector, as usual. However, their d-span vectors are only 
/-dimensional. Recall that the purpose of d-spans is to help the parser to enforce LP 



23 



Computational Linguistics 



Volume X, Number y 



Table 5 

Logic CT: D is the dimensionality of the grammar, I is the dimensionality of the input, and d 
ranges over dimensions; na is the length of the input in dimension d; id ranges over word 
positions in dimension d,l <id< na; Wd,i^ are input words; X, Y and Z are nonterminal 
symbols; i is a terminal symbol; tt is a PAV; a and r are d-spans. 
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constraints, so that the input is covered only once. It would be pointless to constrain 
the absolute positions of items on the output dimensions, because on those dimensions 
there is no input to cover. On the output dimensions, we need only constrain the relative 
order of items. Constraints on the relative order are specified by tt^^^ in the Compose 
grammar term, which is the part of the PAV that pertains to the output dimensions. 

Translator CT is any generalized parser based on Logic CT. Translator CT scans only 
the input components. Terminating productions with active output components are 
Loaded: Their LHSs are added to the chart without d-span information. Composition 
proceeds as before, except that there are no constraints on the precedence arrays in the 
output dimensions — the precedence arrays in tt^^^ are free variables. 

As in Parser C, the first few Compose inferences fired by Translator CT typically link 
items that have no active dimensions in common. If one of the items exists only in the 
input dimension(s), and the other only in the output dimension(s), then this inference 
is, de facto, translation. As for all inference rules, the possible translations are deter- 
mined by consulting the grammar. Thus, in addition to its usual function of evaluating 
linguistic structures, the grammar simultaneously functions as a translation model. 

In summary. Logic CT differs from Logic C as follows: 

• Items store no absolute position information (d-spans) for the output compo- 
nents. 

• For the output components, the Scan inferences are replaced by Load inferences, 
which are just like Scans except that they are not constrained by input. 

• The Compose inference does not constrain the absolute positions of items on the 
output components, although the antecedent PAV still constrains their relative 
positions. 

We have constructed a translator from a multiparser merely by relaxing some con- 
straints on the output dimensions. Table |5l is so similar to Table H] because Parser C 
is just Translator CT for the special case where I ^ D. The relationship between the two 
classes of algorithms is easier to see from their declarative logics than it would be from 
their procedural pseudocode or equations. 

The relationship between translation and ordinary parsing was noted a long time 
ago l |Aho and UUman, 196 9 1, but here we articulate it in more detail: Ordinary parsers 
are a proper subclass of multiparsers, which are a proper subclass of translators. That 
Logic C is a special case of Logic CT explains why we view multiparsers as a subclass 
of translators. It may be counterintuitive to think of algorithms that produce no new 
words as translators, but any analysis or optimization that is valid for translators is 
also valid for multiparsers. The subclass relationship is convenient for both theoretical 
investigation and practical implementation. 

Logic CT can be used with any of the semirings listed in Section For example, 
under a boolean semiring, this logic will succeed on an /-dimensional input if and only 
if it can infer a /^-dimensional multitree, whose root is the goal item. Such a tree would 
contain a {D — /) -dimensional translation of the input. Thus, under a boolean semiring. 
Translator CT can determine whether a translation of the input exists, according to the 
grammar. 

With a probabilistic GMTG (PGMTG) and the inside semiring. Translator CT can 
compute the total probability of all _D-dimensional multitrees containing the /-dimensional 
input. All these derivation trees, along with their probabilities, can be efficiently rep- 
resented as a packed parse forest, rooted at the goal item. Unfortunately, finding the 
most probable output string still requires summing probabilities over an exponential 
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number of trees. This problem was shown to be NP-hard in the one-dimensional case 
I Si ma'an, 1996t . There is no reason to believe that it is any easier in multiple dimensions. 

6.2 Practical Variations 

6.2.1 Other Semirings and Search Strategies The Viterbi-derivation semiring would be 
used most frequently in practice. Given a Z5-PGMTG, Translator CT can use this semir- 
ing to find the single most probable D-dimensional multitree that covers the /-dimensional 
input. For example, suppose that 

• all the productions in Table|3a) have probability 1 .0, except that Production IHl 
has probability 0.7 and Production lITSl has probability 0.3; 

• we employ a imiform cost search strategy, so that the translator makes infer- 
ences in order of decreasing probability of the consequent (ties are broken ac- 
cording to the lexicographic order of the consequent labels); 

• the input is Pasudu moy. 

The inferences that Translator CT would make under these conditions are shown in the 
proof tree in Figure |7| Each internal node represents an item. The children of each 
item are its antecedents. The nonterminal items are numbered to indicate the order of 
inference. For example, the consequents numbered 3 and 6 precede the one numbered 7, 
because the former all have probability 1.0, but the probability of the latter is lowered 
by the probability of its antecedent production rule. The 2nd item is inferred before the 
3rd because the label "PAS/T of the 2nd consequent precedes the label "0/L»" of the 
3rd consequent in the lexicographic order. Note that the information in the proof tree in 
FigurelZlis a superset of the information in in Figure|6l 

One of the productions in Table [S] is absent from Figure |7| By replacing the usual 
CKY search strategy with a more sophisticated one, the translator avoids the expense of 
an inference involving Production M5\ . The benefits of alternative search strategies are 
easier to see when the grammar, the logic, the semiring, and the termination condition 
are abstracted away and held constant. 

6.2.2 Other Logics A naive implementation of Logic CT would be rather inefficient in 
practice. It requires Loading an axiom for each word in the target vocabulary, regardless 
of whether a Loaded word is a possible translation of some input word. With a large 
grammar, most Load consequents would never be Composed with input items, so those 
Load inferences would be a waste of time. A straightforward optimization is to check 
whether a target word might be the translation of some input word before Loading it. 
To implement this optimization, replace the Load inference rule with the following in- 
ference rule, for I < d < D: 
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This inference rule is essentially a macro of two rules from Logic CT: a Load inference, 
and a Compose inference that could follow the Load when a suitable antecedent item 
[y^;T|] has been inferred from the input. The macro will fire once for every (input 
item, target word) pair, where the target word is a possible translation of the input item, 
according to the grammar. This macro admits a greater variety of inferences than the 
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Figure 7 

Proof tree for Translator CT's inference of the multitree in Figure|S| using the GMTG in Table|3l!a) 
with a Viterbi-derivation semiring, on input Pasudu moy. The child nodes of each item contain its 
antecedents. The nonterminal items are numbered to indicate the order of their inference. 
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Link inference rule of |Melamed (2003} , because the antecedent input item [^d; t|] need 
not be a Scan consequent. It can represent an arbitrarily deep multitree over the input 
components. On the other hand, this macro does not allow Load consequents to Compose 
with each other before composing with items covering some part of the input. More 
sophisticated translation logics are necessary to achieve complete flexibility efficiently. 

6.2.3 Other Grammars An SMT system can benefit from mixing the predictions of its 
translation model with those of a more reliable monolingual language model jBrown et al., 1993} . 
The classic way to mix a translation model with a language model is the so-called noisy- 
channel framework. This framework applies to conditional models, but ,Melamed (2004a} 
shows that monolingual language models can also be mixed in a principled way with 
joint models such as probabilistic synchronous grammars. The key benefit of such a 
mixture is that it can help to evaluate every inference fired by a parsing logic. In this 
manner, the language model can greatly accelerate the translation process, in compar- 
ison to algorithms that apply a language model only after a complete multiforest has 
been inferred. For example, the decoder of Yamada and Knight (2002. 1 first builds a for- 
est of multitrees, and then searches for the single most probable output in the forest 
using a language model. If only a single translation is desired, then there is no need to 
compute a parse forest. Moreover, if only the single most probable translation is desired, 
then various priming methods can be used to speed up the search. A PGMTG mixed 
with a target language model can provide sharper term probability estimates, making 
the priming methods more efficient. 

6.3 Discussion 

The multitree inferred by the translator will have the words of both the input and the 
output components in its leaves. In practice, we usually want the output as a string tu- 
ple, rather than as a multitree. Under the various derivation semirings ( jGoodman, 1998} , 
Translator CT can store the output precedence arrays tt^^^ in each internal node of the 
tree. The intended ordering of the terminals in each output dimension can be assembled 
from these arrays by a linear-time linearization post-process that traverses the finished 
multitree in postorder. 

To the best of our knowledge. Translator CT is the first to be compatible with all 
of the semirings listed in Section f2.3l among others. It is also unique in being able 
to accommodate multiple input components and multiple output components simul- 
taneously. When a source document is available in multiple languages, a translator 
can benefit from the disambiguating information in each. Translator CT can take ad- 
vantage of such information without making the strong independence assumptions of 
^ch and Ney (2001 1. When Translator CT is used to translate into multiple languages 
simultaneously, each translation is constrained not only by the input, but also by all the 
other translations. This approach might effect greater consistency across output compo- 
nents, which is one of the putative benefits of the interlingual approach to MT. Indeed, 
the language^^ of multitrees can be viewed as an interlingua. 

7. Hierarchical Alignment 

In Section|6lwe explored inference of /-dimensional multitrees under a ZJ-dimensional 
grammar, where D > I. Now we generalize along the other axis of Figure |2ja). It is 
often useful to infer /-dimensional multitrees without the benefit of an /-dimensional 
grammar. One application is inducing a parser in one language from a parser in another 

^^Here we intend the formal language theory sense of "language." 
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{Lu et al., 20021. The application that is most relevant to this article is bootstrapping an 
/-dimensional grammar. 

In theory, it is possible to estimate a PGMTG from multitext in an imsupervised 
manner, starting with a random or uniform distribution over production rules. How- 
ever, the quality of the parameter estimates is greatly affected by how they are initial- 
ized, so such a simple approach is imlikely to produce good results. A more reliable 
way to estimate PGMTG production probabilities is from a corpus of multitrees — 
a multitreebank.^^ Despite some recent efforts to manually construct multitreebanks 
jUchimoto et al., 2 004 1, it is unlikely that they will become available for more than a 
handful of language pairs any time soon. The most straightforward way to create a 
multitreebank is to parse some multitext using a multiparser, such as Parser C. How- 
ever, if the goal is to bootstrap an /-PGMTG, then there is no /-PGMTG that can evaluate 
the grammar terms in the parser's logic. 

Our solution is to orchestrate lower-dimensional knowledge sources to evaluate the 
grammar terms. Then, we can use our favorite multiparsing logic to align multitext into 
a multitreebank. If we have no PGMTG, then we can use other criteria to evaluate infer- 
ences. These other criteria can be based on various subsets of the information available 
in inference rules. 

For example, given a tokenized set of tuples of parallel sentences, it is always pos- 
sible to estimate a word-to-word translation model FT(uf^^\u\^) i'Brownet al., 1993t ^''. 
Such a probability distribution ranges over parts of the nodes of multitrees. Even if 
we have no basis for choosing among different tree structures, we can prefer multitrees 
whose individual nodes have higher probability. Chiang (2005| generalized this idea 



to bootstrap a synchronous grammar from a pre-existing phrase-to-phrase translation 
model. 

Research on hierarchical alignment has a rich history in the context of example- 
based machine translation. To our knowledge, all the algorithms presented in that con- 
text presume that parse trees are available for all multitext compone nts, which is why 

that subclass of alignment algorithms is usually called tree alignment {Meyers, Yangarber, and Grishman, 1996^ 
or structural matching ( Matsumot o et al., 19 931. The idea that alignment can be carried 
out under much more varied conditions was first put forth by Wu (1995.1 , and further 
expounded by jWu (2000t . In this section, we offer a more precise characterization of 
the relationship between multiparsing and hierarchical alignment, by showing that hi- 
erarchical alignment can be carried out using exactly the same logics, semirings, search 
strategies, and termination conditions as ordinary multiparsing algorithms. A general- 
ization of what coimts as a grammar is sufficient. 

7.1 A Common Scenario 

For an extended example, we consider the common alignment scenario where a lexi- 
calized monolingual grammar is available for just one component. For example, many 
multitexts have at least one component in one of the languages for which treebanks have 
been built^^. Given a treebank in the language of one of the input components, we can 
induce a lexicalized PCFG. Alternatively, if a non-probabilistic parser is available for 
one of the input components, then we can first parse that component, and then proceed 
as we would from a treebank. Regardless of how we obtain it, a monolingual lexicalized 

^'^In contrast, a parallel treebank jHan, Han, and Ko, 200H might contain no information about transla- 
tional equivalence beyond sentence alignments. 

^^Although most of the literature discusses word translation models between only two languages, it 
is possible to use one lang uage as a pivot to combine several 2D models into a higher-dimensional model 
jMann and Yarowsky, 2001 1. 

At the time of writing, we are aware of treebanks for English, Spanish, French, German, Chinese, Czech, 
Arabic, and Korean. 
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Figure 8 

Example of constraints in hierarchical alignment. Only one synchronous dependency structure 
(dashed arcs) is compatible with the monolingual structure (solid arcs) and word alignment 
(shaded squares). 



grammar can guide our search for the multitree with the most probable structure in the 
resource-rich component. More generally/^ we might have a lexicalized PGMTG in D 
dimensions, from which we want to align /-dimensional multitrees, I > D. Without 
loss of generality, we shall let the PGMTG range over the first D components. We shall 
then refer to the D structured components and the I — D unstructured components. 

Given a one-to-one matching between the words in a multitext, choosing the opti- 
mal structure for one component is tantamount to choosing the optimal synchronous 
structure for all components. For example, in Figure |H| a monolingual grammar has 
allowed only one synchronous dependency structure on the English side, and a word- 
to-word translation model has allowed only one word alignment. Ignoring the non- 
terminal labels, only one dependency structure is compatible with these constraints. 
Unmatched nodes in the structured component can be ignored. Unmatched nodes in 
the unstructured component can be heuristically attached either to the left or to the 
right (.Wu, 1995) , or even randomly. More generally, the given word matching need 
not be one-to-one and the structure given for the structured component need not be a 
single tree or a tree at all. Missing substructures and other ambiguities in these input 
constraints can be resolved during the alignment process. 

To combine structural and translational constraints for alignment in this manner, 
it is convenient to suppose that we are inducing a bilexical PGMTG under the Viterbi- 
derivation semiring. Given a bilexical PCFG, or a functionally equivalent approxima- 
tion thereof, we can search for a multitree that simultaneously has a high-probability 
tree structure and a high-probability correspondence among words in its nodes. Such an 
inference process is, by definition, a generalized parser. It can be based on any parsing 
logic, including Logic C. If we have no /-PGMTG, then we can evaluate the grammar 
terms in a way that does not rely on it. Let G() be the function that the grammar uses 
to assign probabilities to production rules. Ordinarily, we have G{LHS ^ RHS) = 
Pt{RHS\LHS). a modified definition is necessary in the t5^ical alignment scenario 
where the grammar has no estimates for Pr{RHS\LHS). 

We begin with terminating productions. For the structured components, we retain 
the usual definition. I.e., G{Xd[hd] hd) = Pr{hd\Xd[hd]), where the latter probability 
can be looked up in a pre-existing /?-PGMTG. For the unstructured components, there 
are no useful nonterminal labels. Therefore, we assume that the unstructured compo- 
nents use only one (dummy) nonternunal label A, so that G{Xd[hd] =^ hd) = 1 if X = A 
and otherwise. 

Our treatment of nonterminating productions follows the standard approach of ap- 
plying the chain rule for conditional probabilities and then making independence as- 
sumptions until all the terms are in a form that can be estimated from data. Readers 



19 Recall that PCFGs are a subclass of PGMTGs. 
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who are not interested in the details can skip ahead to Equation |29l According to the 
chain rule/" 

G{X}[h\]^M[^]]{Yl[g]]Z][h]])) = Vr{^],g],YlZ]\X},h]) (20) 

- Vr{i:],,g]„Yh,Z},\X],h]) (21) 

X Pr(y,^+i , + , g], -yh.zl. ^] . h\m 

X Vx{gY^^\^\,,g],,Y},Z],X],h\) (23) 
X Pr(^f+Vl5,5),i"/,^l,^l,/»}) (24) 

Our first independence assumption is that the structured components of the produc- 
tion's RHS are conditionally independent of the unstructured components of its LHS: 

Vx{i,\,,g\,,Y},,Z\,\X\.h\)=VY{i:\,,g\,,Y},.Z\,\XlXB) (25) 

The above probability can be looked up in the pre-existing _D-PGMTG. Second, since we 
have no useful nonterminals in the imstructured components, we let 

P,(Fr'.Zf«|,l,.»!,.r^ZL,x;./,J) . I ]^^- - , (26) 

Third, we assume that the word-to-word translation probabilities are independent of 
anything else: 

V,{gf+'WD,g],,YlZ],X],h]) = V,{gf+'\g],) (27) 

In a t5rpical alignment scenario, these probabilities would be obtained from a word- 
to-word translation model, which would be estimated under such an independence 
assumption. Finally, we assume that the output precedence arrays are independent of 
each other and uniformly distributed, up to some maximum fan-out /. Let //(/) be the 
number of unique precedence arrays of fan-out / or less. Then 

V,{^f+\\n],,g],Ylz],X],h]) = V,{nf+')= n -;h\ = ^J^- (28) 

d=D+l ' ' 



Under Assumptions EBtESI 



G{X][h]] [iT]]{Yl[g]] Z][h]])) = ^<^D^9\,,Yh.Z^Xj, h\,)-Vv{gj + \g\,) 



(29) 



if yP^^ — Zf^^ — \^^^ and otherwise. The first term in the numerator comes from 
a D-GMTG, and the second term from a conditional word-to-word translation model. 
The denominator is a normalization constant. 

In the most common case that the multitext is just a bitext, and we have a structured 
language model for just one of its components, the above equation boils down to 

Gixiihi] [.iw^^igi] zi[hi])) . (30) 



^''The procedure is analogous when the head-child is the first nonterminal link on the RHS, rather than 
the second, hiformation about which nonterminal link is the head-child can be encoded in the nonterminal 
labels. 
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if ^2 = -^2 = A and otherwise. We can use these estimates in the inference rules of 
Logic C, under any probabilistic semiring. 

More sophisticated methods of hierarchical alignment are certainly possible. For ex- 
ample, we could project a part-of-speech tagger (^Yarowsky, Ngai, and Wicentowski, 2001| 
to improve our estimates in Equation |56| Or we could constrain each component with 
its own monolingual parse tree (Smith and Smith, 2004t . Yet, despite their relative sim- 
plicity, the above methods for estimating production rule probabilities use all of the 
available information in a consistent manner, without double-counting. Bootstrapping 
a PGMTG from a lower-dimensional PGMTG and a word-to-word translation model is 
similar in spirit to the way that regular grammars can help to induce CFGs I ^Lari and Young, 1990| 
and t he way that simple translation models can help to bootstrap more sophisticated 
ones jBrownet al., 1993t . 

7.2 Word Alignment 

A degenerate subclass of hierarchical alignment algorithms is algorithms that produce 
flat structures, where every leaf is a child of the root. This subclass includes some algo- 
rithms for word alignment. A translation lexicon (weighted or not) can be viewed as a 
degenerate GMTG (not in GCNF) where every production has the form 

s h 

: ^ : (31) 

S to 

I.e., each production rewrites the start link into one term inal per compon ent. Under 
such a GMTG, the logic of word alignment is the one in Melamed (2003 1's Parser A. 
However, instead of a single goal item, the goal of word alignment is any set of items 
that covers the input exactly once. Also, since nonterminals do not appear on the RHS 
of production rules. Compose inferences are impossible and unnecessary, so they can be 
removed from the logic if desired. 

8. Parameter Estimation 

As for other probabilistic grammar formalisms, different parameter estimation methods 
are possible for PGMTGs. The traditional method for PCFGs is the Inside-Outside al- 
gorithm (Baker, 1979, 1, which performs unsupervised maximum likelihood estimation. 
Here we present a generalization of the logic behind this algorithm to PGMTGs in 
GCNF. Our generalization can also be used to compute some common approximations 
to maximum likelihood. 

The Inside-Outside algorithm iterates over two stages. The first stage computes 
inside and outside item values. The second stage aggregates and normalizes these val- 
ues to update the grammar. Goodman (1998^ introduces the terms forward value and 
reverse value as generalizations of "inside value" and "outside value", respectively, 
for arbitrary semirings. The previous section described computation of forward values 
in terms of a parsing logic, which is a generalization of Goodman (1998l's bottom-up 
logic for monolingual parsing. For computing reverse values, [Goodman (1998^ offers 
an equation, which we re-express here in terms of forward values V{) and reverse val- 
ues Z{): 

Z{Yj)^ Z{X)® (g) V{Y,) (32) 

such that 
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Goodman (1998| Section 5.1) stated that "we cannot compute the outside probabil- 
ity of a nonterminal until we are finished computing all of the inside probabilities." 
However, Equation 1321 shows that, in general, it is possible to compute reverse values 
before computing all forward values. The only values that are necessary for computing 
the reverse value of an item are the reverse value of the item's parent and the forward 
values of the item's siblings. Forward values of the item's parent or descendants are not 
required. For example, it is possible to compute the reverse value of the NP in Figure|6| 
as soon as the forward value of the V is known, without having computed the forward 
value of the S, the D, or the N. 

It is possible to elaborate the abstract parsing algorithm in Table 13 so that it com- 
putes reverse values using Equation|32| E.g., it could compute them after all forward 
values have been computed, as suggested by Goodman. It could also compute them 
opportunistically, as soon as it knows the reverse value of the consequent X and the 
forward values of all the antecedents Yi, . . . , Y^, i ^ j'. However, the question of when 
to compute term values is a question of search strategy. In keeping with this article's 
method of analysis, we abstract away the search strategy and specify only the computa- 
tional dependencies between item values. Parsing logics are the natural way to express 
such dependencies. Table |6l shows Logic CR, which can compute both forward and 
reverse term values. In addition to admitting a variety of search strategies, this logic 
admits all the parsing semirings studied by Goodman. It can therefore work with the 
unmodified abstract parsing algorithm in Table|21 

The main novelty of Logic CR is its treatment of "reverse" items as a first-class term 
ty^e. The reverse items and reverse inference rules of Logic CR are defined so that 
Z{x) = V{x^). Thus, instead of using Equationl321 the reverse value of an item is com- 
puted by Equation|31as the forward value of the corresponding reverse item. The benefit 
of this treatment is that computations of reverse item values can be subject to the same 
kinds of optimization that are used to speed up computation of forward values, includ- 
ing pruning and logic transformations, such as the one proposed by M elamed (2003.) . 

Let us consider how Logic CR extends Logic C. It has two new term types for record- 
ing the reverse values of terminal and nonterminal items. Reverse terminal items are 
useful for at least two purposes. First, if the input is nondeterministic, such as a word 
lattice coming from the acoustic module of a speech recognizer, then reverse values can 
be useful for pruning the lattice. Second, it is straightforward to generalize Logic CR 
into a logic for translation, the same way that Logic C was generalized to Logic CT. 
Then, reverse values of terminals in the output dimensions could be used to prune and 
reorder items on the agenda, the same way that a target language model is used in 
WFST-based SMT. Interestingly, a reverse terminal item can involve any terminal in the 
grammar, which may or may not correspond to any word in the input. It is perfectly 
valid to compute reverse values for partial parses whose forward value remains at its 
initial default (e.g. probability zero). If such values are unnecessary for the application 
at hand, then their computation can be avoided using logic optimizations analogous to 
the macro inference rule in Section l6.2.2l 

Logic CR also introduces a new kind of axiom called a pivot, which declares the 
reverse value of the item that spans the whole input and has the grammar's start symbol 
as its label. It is impossible to infer this value, because computation of an item's reverse 
value requires knowing its parent's reverse value, and an item sparming the whole input 
cannot have a parent. Fortunately, it is imnecessary to compute this value, because 
the reverse value of any item labeled with the grammar's start symbol is always the 
multiplicative identity of the semiring. 

Logic CR has new rules for inferring the new item types. Two Reverse Compose 
rules are required: one for the case where the consequent label comes first on the RHS 
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Table 6 

Logic CR: D is the dimensionality of the grammar and d ranges over dimensions; rid is the 
length of the input in dimension d; id ranges over word positions in dimension d, 1 < id < rid; 
Wd,ia are input words; X, Y and Z are nonterminal symbols; t is a terminal symbol; tt is a PAV; 
f, a and t are d-spans. 



Tenn Types 

terminal items 
nonterminal items 

terminating productions 

nonterminating productions 



(d, i, t) and (d, i, t)^ 

and[Xi,;al,]'' 

X =^ t for 1 < d < £> 
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input words 
grammar terms 
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Scan 
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of the antecedent grammar term, and one for the case where it comes second. The 
computations of the PAV in the antecedent and the d-span in the consequent of the 
Reverse Compose rules involve two new operators, which perform the inverse operations 
of + and I over d-spans: 

• — is the subtraction operator: Given an ordered pair of d-spans ly and t, it 
outputs the d-span cr such that a + t — u. The output is undefined if r contains 
intervals not covered by i^. 

• is the re verse relativization operator, defined by the equation ;/0r = {i'—t)It. 

Both operators apply componentwise to vectors of d-spans. 

With an inside semiring. Logic CR can compute the inside and outside item values 
required for a multidimensional Inside-Outside algorithm. This algorithm can be used 
to re-estimate the parameters of a PGMTG in GCNF. Equations for aggregation and 
normalization are necessary to complete the specification of the algorithm. Let ( [i] ) 
be the value of item [i] on iteration q. Let (P) be the value assigned by the grammar to 
production rule P on iteration q. Let T be the RHS of a terminating production rule. Let 
/ be a vector of word positions and let 1 be a vector of I's. Then the update equations 



21 

are : 



^ ' EiVHix-Ai -1,1)]) - v^iixai ^ ' 



C+^iX =>M [t I a]{Y Z)) = 

Er a V'ii^-^ ^ + O ■ ^ny; r]) ■ V^m a]) ■ C^jX [r I a]{Y Z)) 

To aggregate over multiple training sentence tuples, augment word positions and span 
boundaries to record the sentence tuple number. 

Parsing under the inside semiring requires summing over all possible derivations of 
the training data, which precludes the efficiency mechanisms suggested in Section IS^H 
Given the computational expense of exhaustive multiparsing, cheaper approximations 
are often desirable. Instead of computing over all possible derivations, we can use only 
the n best derivations for some fixed maximum n. This approach was also suggested by 
Brown et al. (1993') for the more sophisticated of their translation models. Logic CR can 
compute this approximation without modification if it is used with the Viterbi semiring 
or it's n-best generalization. The above update equations are appropriate regardless of 
which of these semirings is used to compute the values VQ. It is also possible to use a 
variant of the above equations when G{) ranges over values in an expectation semiring 
jtiisner, 2 002 1. Such a variant, together with Logic CR, could compute the expected 
feature counts necessary to re-estimate a maximum entropy s5mchronous grammar of 
the kind used by Chiang (200 5 1. 

Our development of Logic CR was motivated by parameter estimation for PGMTGs. 
jGoodman (1998[ Section 2.4) suggests several other applications of reverse semiring val- 
ues: 

• pruning, 

• defining non-standard criteria for parser performance, and then 
^^We omit ttie dimension indexes to reduce clutter. 
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• improving parser performance on those criteria, which can result in 

• faster parsing, even without pruning. 

hi all these applications it is useful to know reverse item values before all the forward 
values are known. 

In addition to the usual "root" goal item, the termination conditions for a typical ap- 
plication of Logic CR would involve the set of reverse terminal items that correspond to 
the input words. Loosely speaking. Logic CR would aim to reach the root goal bottom- 
up, and then return to the input that it started from top-down. We therefore refer to the 
class of logics that infer both forward and reverse values as round-trip parsing logics. 



9. Translation Evaluation by generalized parsing 



hi recent years, it has become de rigueur to evaluate MT systems object ively, using auto- 
mated comparison with reference translations (Thompson, 1991 Brew and Thomps on, 1994| . 
All the currently popular evaluation measures compute some form of string similarity. 
It is not difficult to imagine how such measures can miscalculate. For example, sup- 
pose that the reference translation is (R) below, and that two MT systems output the 
translations (Tl) and (T2). 

(R) Pat asked Sandy on Friday about the man from Oslo. 

(Tl) On Friday, Pat asked Sandy about the man from Oslo. 

(T2) Pat from Oslo asked Sandy on Friday about the man. 

The sentences in this example are neither long nor complicated. Yet all of the cur- 
rently popular automatic evaluation methods would incorrectly assign a higher score 
to (T2) than to (Tl), because (T2) has a longer matching n-gram with (R). The problem 
is that string similarity is only a crude approximation to conceptual similarity. Methods 
that measure the grammaticality of translations independently of a reference translation 
(e.g., jRajman and Hartley, 200 1 1) are also incapable of making the desired distinctions 
— (Tl) and (T2) are equally grammatical. 

More sophisticated MT systems will require more sophisticated evaluation meth- 
ods. In order to correctly evaluate examples like the one above, an evaluation method 
needs a catalogue of the syntactic alternations that preserve the meaning of an utter- 
ance. Synchronous grammars offer a perspicuous way to describe such alternations. 
For example, the production 



NP 



[1,2,3] 
[1,3,2] 



/NN PPi PP2 
l^NN PPi PP2 



(35) 



could be included, to allow prepositional phrases modifying the same head to switch 
places. However, the relative order of determiners and adjectives in English noun 
phrases is strict, so the production 



NP 
NP 



>IX1 



[1,2,3] 
[2,1,3] 



/Det Adj NN 
[pei Adj NN 



(36) 



would not be included. The grammar would also include productions that have identi- 
cal components in both dimensions. 

Given such a grammar G, a reference translation R, and an MT system output T, 
a multiparser can attempt to find a multitree covering the bitext (i?, T) under G. If 
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the parser succeeds, then, according to the grammar, T is a valid translation (actually, 
paraphrase) of R. If the parser fails, then T is not an acceptable paraphrase of R, either 
because it does not mean the same thing or because it is ungrammatical. 

There are two practical problems with this approach. First, it is usually desirable to 
obtain a numerical grade of translation quality, rather than just a boolean indicator of 
acceptability. Second, it would probably be infeasible, or at least unreliable, to compile 
all the valid syritartic alternations manually. One possible solution to these problem was 
proposed by |Leusch7 Uef f ing, and Ney (200^, who restricted themselves to a Bracket- 
ing Transduction Grammar I Wu, 1995 1 with just one dummy nonterminal, partitioned 
its possible production rules into seven classes, and manually assigned a cost for each 
class. 

An alternative approach is to estimate the required grammar empirically. The Lin- 
guistic Data Consortium has recently published several "multiple-translations" cor- 
pora. These are corpora containing multiple independent translations of a set of source 
documents, aligned at the sentence level. Each set of independent translations can 
be viewed as mutual paraphrases jPang, Knight, and Marcu, 200 31. We can estimate 
a monolingual PGMTG^^ from these sets of parallel sentences using exactly the same al- 
gorithms that we use to estimate multilingual PGMTG, as described in Sections|7|and|Sl 
Using such a PGMTG, a probabilistic multiparser can return the probability that a transla- 
tion is valid with respect to a reference. Different translations and the MT systems that 
output them can be compared on these scores. 

MT evaluat ion by means of a monolingual PGMTG has two advantages over string- 
based methods jMelamed, 1995| Papineni et al., 2002 [Melamed, Green, and Turian, 2003t . 



First, this method can be sensitive to meaning-preserving S5mtactic alternations. To the 
extent that human judges use such information in evaluating MT outputs, an automatic 
evaluation method that uses such information might do a better job of predicting hu- 
man judgments. Second, the method itself can be objectively evaluated in terms of its 
model's ability to predict held-out data. Such meta-evaluation can be performed with- 
out expensive and unreliable human judgments. 

A temporary disadvantage of this approach is that research on multitext modeling 
has not begun yet. The problem of inducing a PGMTG can be approached from the 
perspective of bilingual language modeling ^Wu, 1997\ , with at least all the methods and 
challenges of monolingual language modeling. Estimation of a monolingual PGMTG 
would be hampered by the relatively small size of suitable training data. On the other 
hand, it is easier to estimate a translation model from a given language to itself than 
to other languages, if only because the identity relation provides an excellent word-to- 
word translation model as a starting point. 

When good multitext models become available, generalized parsers will become 
the engine driving yet another important part of the standard SMT architecture. 

10. Putting it all together 

Figure |51 shows the data-flow diagram for a rudimentary SMT system that is driven 
by tree-structured translation models. All the generalized parsing algorithms involved 
can be implemented as different parameterizations of the abstract parsing algorithm in 
Section lZSl Below is a sample recipe for running a system of this kind through training, 
application to new inputs, and evaluation. Unless stated otherwise, each generalized 
parser's goal is an item that spans the input and is labeled with the start symbol of 
the grammar. At runtime, the abstract parser 's termination conditions would typically 



^^I.e., a PGMTG that generates the same language in all components. 
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Data-flow diagram for a rudimentary system for SMT by parsing. Boxes are data; ovals are 
processes; arcs are flows; dashed flows and data are recommended but optional. 
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involve goal items as well as limits on time and / or memory consumption. Any search 
strategy can be used, at least in theory In practice, we must manage computational 
complexity, so best-first search is a common favorite. 

Tl. Induce a word-to-word translation model. Use Logic A I'Melamed, 20031 with 
enough goal items spanning one word from each component to cover the input. 
Alternatively, publicly available WFST-based tools can be used l |Och and Ney, 2003| . 

T2. Induce PCFG(s) from monolingual treebarLk(s), e.g. by computing the relative 
frequencies of productions. 

T3. Hierarchically align the training multitext, e.g. using Logic C and the derivation- 
forest semiring. Constraining PCFG(s) and a word-to-word translation model 
can be used to imitate a PGMTG, as described in Section |7| Other approxi- 
mations can be used if these knowledge sources are not available or if other 
relevant knowledge sources are available. 

T4. Induce an initial PGMTG from the multitreebank, e.g. by computing the relative 
frequencies of productions. 

T5. Re-estimate the PGMTG parameters using Logic CR, starting with the initial 
PGMTG. Ideally, use the inside semiring, but if that's too expensive, then use 
Viterbi-n-best. In addition to the usual goal item, the termination condition 
involves the reverse item corresponding to each of the input axioms. 

Tl'-T5' Same as T1-T5, but starting with monolingual multitext. The identity relation 
can be used for Tl' as a short-cut. 

Al. Use the PGMTG to infer the most probable multitree covering the input multi- 
text. Use Logic CT under the Viterbi-derivation semiring. If a target language 
model is available, use Logic CTM <Melamed, 2004at . 

A2. Linearize the output yield of the multitree. 

El. For each component of the test output, multiparse the bitext consisting of this 
component and the corresponding reference translation, using Logic C under 
the inside semiring and the monolingual PGMTG. 

A variety of algorithms have been proposed for Process Tl jMelamed, 2 000':'Oc h and Ney, 2003| 
and some of them are available as free software. Processes T2, T4, T2', T4', and A2 are 
trivial. Processes T3, T5, T3', T5', Al, and El are the generalizations of parsing and 
their applications presented in this article. The "Statistical Machine Translation by Pars- 
ing" team at the 2005 JHU Language Engineering Workshop used this recipe to build 
GenPar, the first publicly available system of this type (Burbank et al., 2005^ . GenPar 
revolves aroimd a single abstract parser. 

11. Summary and Outlook 

This article has extended the theory of semiring parsing to present a new analysis of 
many common parsing algorithms, as well as other algorithms that are not usually con- 
sidered parsing algorithms. The analysis revealed that all of these algorithms can be 
implemented by an abstract parsing algorithm with five functional parameters: a gram- 
mar, a logic, a semiring, a search strategy, and a termination condition. The article then 
varied two of these functional parameters — the logic and the grammar — to arrive at 
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the class of translators and the class of hierarchical aligners. In this manner, the article 
has elucidated the relationships between ordinary parsing and these other classes of al- 
gorithms more precisely than previously possible. The article then presented two new 
applications of generalized parsing, and showed how the various generalizations and 
their applications can be used to do all the heavy lifting in a rudimentary system for 
statistical machine translation by generalized parsing. 

There are distinct advantages to building SMT systems in this manner. The software 
engineering advantage is that improvements invented for one of these algorithms can 
often be applied to all of them. For example, Melamed (2003 1 showed how to reduce the 
computational complexity of a multiparser by a factor of n^, just by changing the logic. 
The same optimization can be applied to any generalized parser based on Logics C, 
CT, or CR. With good software design, such optimizations need never be implemented 
more than once. Researchers who adopt this approach can concentrate their talents on 
better models, without worrying about system-specific "decoders." 

A more important advantage in the long term is that this approach to building MT 
systems encourages MT research to be less specialized and more transparently related 
to the rest of computational linguistics. A well-understood connection between parsing 
and SMT algorithms can foster a stronger connection between research in SMT and 
research in the rest of computational linguistics, a connection that has been weakening 
in recent years to the detriment of both research communities. Research on SMT by 
Parsing can build on past and future research on ordinary parsing. Stronger connections 
between the two research communities would enable more researchers to contribute 
to MT research, accelerating progress. Conversely, we expect generalized parsers to 
be useful for other problems with a similar structure, such as sentence compression 
(Knight and Marcu, 2000 1 and structured generation l |Langkilde, 2000i . 

The viability of statistical machine translation by generalized parsing will hinge on 
development of more powerful logics and grammar formalisms than the simplistic ex- 
amples used in this article. Improved machine learning methods will also be critical. 
We conjecture that the best SMT systems of the near future will combine new learning 
algorithms with the expressive power of tree-structured translation models. However, 
inference is likely to remain the main source of complexity, both conceptual and com- 
putational, in these new learning algorithms. As better parameters are invented for the 
abstract parsing algorithm, we expect the abstractions presented in this article to be- 
come even more important in reducing the complexity of statistical machine translation 
systems. 
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